Important Disclaimers

The analysis presented below is intended for teaching purposes only. That is, it has not been vetted for statistical significance, market effectiveness, trading costs, predictive accuracy, or any other validation of the parameters and models selected. We highly recommend that you do not trade based on this model, as that is not the intent of this seminar.

Further, the views in this document do not represent those of the presenters Sandeep Balasubramanian and Robert Ying, nor those of the Application Development Initiatve, the Columbia Data Science Society, AQR Capital, or D.E. Shaw & Co.

That said, we hope you learn something about applied data science and timeseries forecasting!

Setup:


In [1]:
import pandas as pd
import numpy as np
from pandas.tools.plotting import autocorrelation_plot
from matplotlib import pyplot as plt
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm

plt.rcParams['figure.figsize'] = (10, 6)
%matplotlib inline

Forecasting with Time Series Data

Introduction to Time Series Data

Much of the data we interact with on a daily basis are drawn from time series. What's the temperature outside today? Is my diet working? How's the stock market doing? A time series is a set of observations of a single variable at multiple different points in time. Time series data differs from other data because these observations aren't usually independent from each other. For example, the stock price of Microsoft today is very related to the MSFT price yesterday. In contrast, if you were to sample people in this room to look for a relationship between height and GPA, you could be reasonably confident that each observation $(height, gpa)$ is independent from each other.

Modelling a Time Series

Stationarity

A process (from which a time series is sampled) is said to be stationary if the distribution of the observed values does not depend on time. One example of a nonstationary process is a stock price -- over the decades there's certainly been an upward trend, so the average price of IBM in 1995 is different than the average price of IBM in 2015. What we want is the distribution of the observed variable to be independent of time. We'd like for the mean and our variance of our observations to be constant over time.

Taking the trends out of data to make it stationary allows us to properly run regressions against other variables (otherwise we risk spurious results that conflate the time trend with the effect of the other variables). We can make data stationary by taking differences of of the observations. For example, below we have a graph of drug sales. There is clearly a trend over time--by looking at the difference in sales instead of the sales numbers themselves, we can more accurately find the relationships between it and a proposed explanatory variable.

A process that becomes stationary after differencing d times is said to be integrated with order d.

Autoregressive Model

In an autoregressive model, the response variable is regressed against previous values from the same time series. For example, we might think that the price of MSFT today is some linear combination of prices of MSFT over the last few days:

$ \begin{equation*} price_{t}=\beta_{0}+\beta_{1}price_{t-1}+\beta_{2}price_{t-2}+\beta_{3}price_{t-3}+\epsilon_{t}. \end{equation*} $

We say that a process if AR(1) if only the previous observed value is used. A process that uses the previous p values is called AR(p). A classic example of an AR(1) process is a random walk. In a random walk, a "walker" has an equal chance of stepping left or stepping right. At each time observation, the walker randomly takes a step. The position of this walker over time is therefore a direct function of where they were at the previous time and the random step direction.

**a single random walker**

**various possible random walks**

Moving Average Model

A moving average model is similar to an autoregressive model except that instead of being based on the previous observed values $y_t$, the model describes a relationship between an observation and the previous error terms. We say that a process if MA(1) if only the previous observed value is used. A process that uses the previous p values is called MA(p).

As an example, imagine looking at time series data around Moore's law. "Moore's law" is the observation that, over the history of computing hardware, the number of transistors in a dense integrated circuit has doubled approximately every two years.

Notice that the y axis has a log scale. The trend here appears linear after taking the log. Is our transistor count stationary? Not yet! However, if we difference the data once, now we're talking about the slope -- purely by inspection, the slope looks like it's pretty constant through time. Therefore we could guess that the change in log transistor count is stationary over time. We might consider using an MA(1) process to model this. This would suggest that if the change in log transistor count is lower by 2% in period 1, that error would adjust the forecast of period 2.

Analysis

Our Dataset

The New York Independent System Operator (NYISO) operates competitive wholesale markets to manage the flow of electricity across New York. We will be using this data, along with weather forecasts, to create a model that predicts electricity prices.

Definitions

NYISO: New York Independent System Operator

LBMP: Locational Based Marginal Pricing

Day Ahead Market: The Day-Ahead Energy Market lets market participants commit to buy or sell wholesale electricity one day before the operating day, to help avoid price volatility.

Realtime Market: The Real-Time Energy Market lets market participants buy and sell wholesale electricity during the course of the operating day. The Real-Time Energy Market balances the differences between day-ahead commitments and the actual real-time demand for and production of electricity.

Vintage Date: The day that the weather is predicted for is the vintage date.

Forecast Date: The date on which the above temperature is predicted is the forecast date.

Vintage Category: A prediction is made on several days leading up to the vintage date. Additionally, the temperature on the vintage date is also recorded after the date has passed. To distinguish between these, there are two types of observations: forecasts (the prediction) and actual (the realized temperature, as documented on a later date).


In [2]:
# load all the data
day_ahead_market = pd.read_csv('datasets/timeseries/day_ahead_market_lbmp.csv.gz')
real_time_market = pd.read_csv('datasets/timeseries/real_time_market_lbmp.csv.gz')
weather_forecast = pd.read_csv('datasets/timeseries/weather_forecast.csv.gz')

In [5]:
day_ahead_market.head()


Out[5]:
Time Stamp Name PTID LBMP ($/MWHr) Marginal Cost Losses ($/MWHr) Marginal Cost Congestion ($/MWH
0 01/01/2015 00:00 CAPITL 61757 33.45 1.13 -11.71
1 01/01/2015 00:00 CENTRL 61754 21.89 -0.12 -1.41
2 01/01/2015 00:00 DUNWOD 61760 31.27 1.79 -8.88
3 01/01/2015 00:00 GENESE 61753 20.84 -0.86 -1.11
4 01/01/2015 00:00 H Q 61844 20.25 -0.35 0.00

In [6]:
real_time_market.head()


Out[6]:
Time Stamp Name PTID LBMP ($/MWHr) Marginal Cost Losses ($/MWHr) Marginal Cost Congestion ($/MWHr)
0 01/01/2015 00:05:00 CAPITL 61757 32.51 1.53 -9.07
1 01/01/2015 00:05:00 CENTRL 61754 22.84 -0.15 -1.09
2 01/01/2015 00:05:00 DUNWOD 61760 30.72 1.95 -6.87
3 01/01/2015 00:05:00 GENESE 61753 21.65 -1.12 -0.86
4 01/01/2015 00:05:00 H Q 61844 21.53 -0.37 0.00

In [7]:
# convert string date column to datetime type
day_ahead_market['Time Stamp'] = pd.to_datetime(day_ahead_market['Time Stamp'], format='%m/%d/%Y %H:%M')
real_time_market['Time Stamp'] = pd.to_datetime(real_time_market['Time Stamp'], format='%m/%d/%Y %H:%M:%S')

In [8]:
# convert dates to datetimes on the weather data
weather_forecast['Forecast Date'] = pd.to_datetime(weather_forecast['Forecast Date'], format='%m/%d/%Y')
weather_forecast['Vintage Date'] = pd.to_datetime(weather_forecast['Vintage Date'], format='%m/%d/%Y')
weather_forecast['Vintage'] = weather_forecast['Vintage'].astype('category')

weather_forecast.head()


Out[8]:
Forecast Date Vintage Date Vintage Station ID Max Temp Min Temp Max Wet Bulb Min Wet Bulb
0 2015-01-01 2015-01-01 Forecast ALB 31 18 23 15
1 2015-01-01 2015-01-01 Forecast ART 32 24 28 21
2 2015-01-01 2015-01-01 Forecast BGM 27 15 21 12
3 2015-01-01 2015-01-01 Forecast BUF 30 21 26 18
4 2015-01-01 2015-01-01 Forecast ELM 32 16 24 13

In [9]:
# re-index the data by name of region and timestamp. This helps us manipulate and access the data. 
dam_time_name = day_ahead_market.set_index(['Name', 'Time Stamp'])
rtm_time_name = real_time_market.set_index(['Name', 'Time Stamp'])

dam_time_name.head()


Out[9]:
PTID LBMP ($/MWHr) Marginal Cost Losses ($/MWHr) Marginal Cost Congestion ($/MWH
Name Time Stamp
CAPITL 2015-01-01 61757 33.45 1.13 -11.71
CENTRL 2015-01-01 61754 21.89 -0.12 -1.41
DUNWOD 2015-01-01 61760 31.27 1.79 -8.88
GENESE 2015-01-01 61753 20.84 -0.86 -1.11
H Q 2015-01-01 61844 20.25 -0.35 0.00

In [10]:
# select just the NYC data
dam_nyc_lbmp = dam_time_name['LBMP ($/MWHr)']['N.Y.C.']
rtm_nyc_lbmp = rtm_time_name['LBMP ($/MWHr)']['N.Y.C.']

In [11]:
plt.figure(figsize=(10,8))
dam_nyc_lbmp.plot(title='NYC Day Ahead LBMP 2015')


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x12075270>

In [12]:
plt.figure(figsize=(10,8))
rtm_nyc_lbmp.plot(title='NYC Realtime LBMP 2015')


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x22897c90>

In [13]:
# Real Time Data Peek
rtm_nyc_lbmp.head()


Out[13]:
Time Stamp
2015-01-01 00:05:00    30.79
2015-01-01 00:10:00    30.75
2015-01-01 00:15:00    31.32
2015-01-01 00:20:00    30.79
2015-01-01 00:25:00    31.26
Name: LBMP ($/MWHr), dtype: float64

In [14]:
# Day Ahead Data Peek
dam_nyc_lbmp.head()


Out[14]:
Time Stamp
2015-01-01 00:00:00    31.45
2015-01-01 01:00:00    26.49
2015-01-01 02:00:00    24.54
2015-01-01 03:00:00    23.64
2015-01-01 04:00:00    24.10
Name: LBMP ($/MWHr), dtype: float64

Now there's a slight problem--the timestamps on the realtime data and day ahead data don't actually line up. More specifically, the realtime data has observations every 5 minutes while the day ahead data has observations every hour. To fix this, we need to align the two series of data.


In [15]:
aligned_dam, aligned_rtm = rtm_nyc_lbmp.align(dam_nyc_lbmp, join='inner')

In [16]:
# Remove Duplicates

no_dup_al_dam = aligned_dam[~aligned_dam.index.duplicated(keep='first')]
no_dup_al_rtm = aligned_rtm[~aligned_dam.index.duplicated(keep='first')]

no_dup_al_dam.name = 'dam_lbmp'
no_dup_al_rtm.name = 'rtm_lbmp'

Next step is to wrap this up nicely in a data frame. However, just constructing the data frame the default way will give us a very wide dataframe-- to make it into a tall one we'll need to transpose.


In [17]:
# this dataframe is too wide
pd.DataFrame([no_dup_al_dam, no_dup_al_rtm])


Out[17]:
Time Stamp 2015-01-01 01:00:00 2015-01-01 02:00:00 2015-01-01 03:00:00 2015-01-01 04:00:00 2015-01-01 05:00:00 2015-01-01 06:00:00 2015-01-01 07:00:00 2015-01-01 08:00:00 2015-01-01 09:00:00 2015-01-01 10:00:00 ... 2015-12-31 14:00:00 2015-12-31 15:00:00 2015-12-31 16:00:00 2015-12-31 17:00:00 2015-12-31 18:00:00 2015-12-31 19:00:00 2015-12-31 20:00:00 2015-12-31 21:00:00 2015-12-31 22:00:00 2015-12-31 23:00:00
dam_lbmp 32.94 62.32 37.20 29.75 44.51 25.81 28.04 22.07 17.94 9.05 ... 22.00 20.41 14.56 19.25 27.65 57.16 17.77 22.02 14.00 16.14
rtm_lbmp 26.49 24.54 23.64 24.10 26.32 31.48 30.57 31.34 33.21 29.85 ... 24.88 25.54 31.94 41.37 32.24 30.76 27.99 24.12 22.61 19.84

2 rows × 8758 columns


In [18]:
# transpose dataframe to make it tall
dam_rtm_df = pd.DataFrame([no_dup_al_dam, no_dup_al_rtm]).transpose()
dam_rtm_df.head()


Out[18]:
dam_lbmp rtm_lbmp
Time Stamp
2015-01-01 01:00:00 32.94 26.49
2015-01-01 02:00:00 62.32 24.54
2015-01-01 03:00:00 37.20 23.64
2015-01-01 04:00:00 29.75 24.10
2015-01-01 05:00:00 44.51 26.32

Now that we have our pricing data for the NYC region, we need to get the weather data ready. The weather data comes from a different data source, and unfortunately it's not split into the same exact regions as the pricing data. To remedy this, we'll pick two weather stations nearby -- the ones located at JFK airport and LGA airport -- and average the temperatures together.


In [19]:
# a reminder about what the weather data looks like
weather_forecast.head()


Out[19]:
Forecast Date Vintage Date Vintage Station ID Max Temp Min Temp Max Wet Bulb Min Wet Bulb
0 2015-01-01 2015-01-01 Forecast ALB 31 18 23 15
1 2015-01-01 2015-01-01 Forecast ART 32 24 28 21
2 2015-01-01 2015-01-01 Forecast BGM 27 15 21 12
3 2015-01-01 2015-01-01 Forecast BUF 30 21 26 18
4 2015-01-01 2015-01-01 Forecast ELM 32 16 24 13

In [21]:
# get every temperature from the LGA and JFK stations
lga_and_jfk_indexed = weather_forecast[(weather_forecast['Station ID'] == 'LGA') |
                                       (weather_forecast['Station ID'] == 'JFK')].set_index(['Forecast Date',
                                                                                             'Vintage Date',
                                                                                             'Vintage',
                                                                                             'Station ID'])
lga_and_jfk_indexed.head()


Out[21]:
Max Temp Min Temp Max Wet Bulb Min Wet Bulb
Forecast Date Vintage Date Vintage Station ID
2015-01-01 2015-01-01 Forecast JFK 39 27 31 22
LGA 37 26 29 21
2015-01-02 Forecast JFK 42 31 34 27
LGA 41 32 32 27
2014-12-29 Actual JFK 46 36 37 30

In [22]:
# average together LGA and JFK temperatures (you'll notice that it's no longer in our data!)
mean_nyc_indexed = lga_and_jfk_indexed.mean(level=[0,1,2])
mean_nyc_indexed.head()


Out[22]:
Max Temp Min Temp Max Wet Bulb Min Wet Bulb
Forecast Date Vintage Date Vintage
2015-01-01 2014-12-29 Actual 44.5 36.0 36.5 30.0
2014-12-30 Actual 35.5 28.5 30.0 23.0
2014-12-31 Actual 32.5 27.5 25.5 22.5
2015-01-01 Forecast 38.0 26.5 30.0 21.5
2015-01-02 Forecast 41.5 31.5 33.0 27.0

We need to massage our data a bit more. First step is to unindex, which flattens our dataframe. Next, we pick out only the rows with the Vintage being 'Actual'. We've got a few different kinds of dates floating around and it's important to stay consistent. Our end goal is to have our data like:

Time stamp: today

Temperature: the actual temperature today (x)

Day ahead price: today's price for electricity tomorrow (x)

Realtime price: tomorrrow's electricity price observed tomorrow (y)


In [23]:
mean_nyc = mean_nyc_indexed.reset_index()
actual_temp_df = mean_nyc[mean_nyc['Vintage'] == 'Actual'] \
    .groupby(['Vintage Date']).first() \
    .rename(columns=lambda x: 'Actual ' + x) # prepend the word Actual to column names

dam_rtm_act_df = dam_rtm_df.join(actual_temp_df, how='left').fillna(method='ffill').dropna()

dam_rtm_act_df.head()


Out[23]:
dam_lbmp rtm_lbmp Actual Forecast Date Actual Vintage Actual Max Temp Actual Min Temp Actual Max Wet Bulb Actual Min Wet Bulb
Time Stamp
2015-01-02 00:00:00 16.87 28.77 2015-01-03 Actual 43 35.5 33.5 30
2015-01-02 01:00:00 26.47 27.61 2015-01-03 Actual 43 35.5 33.5 30
2015-01-02 02:00:00 15.07 24.41 2015-01-03 Actual 43 35.5 33.5 30
2015-01-02 03:00:00 14.24 24.01 2015-01-03 Actual 43 35.5 33.5 30
2015-01-02 04:00:00 12.95 24.03 2015-01-03 Actual 43 35.5 33.5 30

Next, observe that there is a different day ahead price every single hour, but only a single temperature estimate per day. We'll do something similar to the frequency mismatch we saw with day ahead/realtime data -- resample at the lower frequency and average the data in those lower frequency buckets.


In [24]:
# Resample to daily because temperature and prices don't have the same frequency
daily_df = dam_rtm_act_df.resample('D', how='mean')
daily_df.head()


Out[24]:
dam_lbmp rtm_lbmp Actual Max Temp Actual Min Temp Actual Max Wet Bulb Actual Min Wet Bulb
Time Stamp
2015-01-02 47.749583 32.220833 43.0 35.5 33.5 30.0
2015-01-03 33.246667 27.447917 43.0 31.0 42.5 27.5
2015-01-04 29.820417 23.439583 55.0 42.0 53.0 41.0
2015-01-05 66.288333 42.450000 51.0 22.5 43.0 17.5
2015-01-06 64.860833 60.502917 22.5 19.0 20.0 15.0

The next question is whether to use the max temp or min temp--let's take a look at the data to see if one or the other makes a difference.


In [26]:
plt.figure(figsize=(14,10))
plt.plot_date(daily_df.index, daily_df['rtm_lbmp'], '-', label='RTM LBMP')
plt.plot_date(daily_df.index, daily_df['dam_lbmp'], '-', label='DAM LBMP')
plt.plot_date(daily_df.index, daily_df['Actual Min Temp'], '-', label='Min Temp')
plt.plot_date(daily_df.index, daily_df['Actual Max Temp'], '-', label='Max Temp')
plt.legend()


Out[26]:
<matplotlib.legend.Legend at 0x2336fcd0>

Min and max temperature seem to be so highly correlated with each other that it most likely won't matter one way or another which we used.

Model Estimation

Now that our data is in the format we want, we can finally put into practice some of those theoretical concepts we discussed at the beginning. Our tool of choice here is a type of regression called ARIMA. It's a combination of 3 types of models -- AutoRegressive (a combination of the previous values), Integrated (differencing the data), Moving Average (smoothing the data). It's typically represented as:

arima(# of AR terms, # of differences, # of smoothing terms) = $ arima(p,d,q) $.

So how can we go about picking our parameters p, i, and q? There's several approaches of varying sophistication. A common approach is to use the Box-Jenkins Method for parameter selection. First, you pick parameter d (does the data need differencing to make it stationary?) You can do this using a Dickey-Fuller test. Then, you can use autocorrelation function (ACF) and partial autocorrelation function (PACF) to identify the AR parameter (p) and the MA parameter (q).

For our example, we'll just pick ARIMA(0,0,1) for it's simplicity (this is therefore just an MA(1) process). At the end we discuss these tools / the procedure in more detail.

Before fitting the model, we'll be splitting our data into two chunks -- one to train the model on and one to test it. This gives us a chance to see how the model would do in the "real world" on data it hasn't seen before.


In [27]:
exog_data = np.array([daily_df['Actual Max Temp'].values, daily_df['dam_lbmp'].values])

# fit the model on a portion of the data
k = 250
m = ARIMA(daily_df['rtm_lbmp'].values[0:k], [0,0,1], exog=np.transpose(exog_data[:,0:k]), dates=daily_df.index.values[0:k])
results = m.fit(trend='nc', disp=True)

# what does our model predict?
predicted_prices = results.predict(10, 364, exog=np.transpose(exog_data), dynamic=True)

# let's see it graphically
plt.figure(figsize=(14, 10))
plt.plot(predicted_prices, label='prediction')
plt.plot(daily_df['rtm_lbmp'].values, label='actual RTM')
#plt.plot(daily_df['dam_lbmp'].values, label='actual DAM')

plt.legend()


Out[27]:
<matplotlib.legend.Legend at 0x23211250>

In [28]:
results.summary()


Out[28]:
ARMA Model Results
Dep. Variable: y No. Observations: 250
Model: ARMA(0, 1) Log Likelihood -1029.430
Method: css-mle S.D. of innovations 14.855
Date: Mon, 29 Feb 2016 AIC 2066.860
Time: 17:07:58 BIC 2080.946
Sample: 01-02-2015 HQIC 2072.529
- 09-08-2015
coef std err z P>|z| [95.0% Conf. Int.]
x1 0.0571 0.027 2.126 0.035 0.004 0.110
x2 0.8586 0.033 25.844 0.000 0.793 0.924
ma.L1.y 0.4423 0.061 7.239 0.000 0.323 0.562
-2.2608 +0.0000j 2.2608 0.5000
Roots
Real Imaginary Modulus Frequency
MA.1

Analysis

Our model seems to work well from the graph of the the forecast. Looking at the coefficients from the ARIMA model, we can see that that increasing temperatures and increasing day ahead prices are both associated with higher realtime prices the next day. Just by looking at the coefficients, one can see that our forecast is extremely similar to the day ahead values, with a small adjustment based on temperature and another adjustment based on the moving average term.


In [29]:
plt.figure(figsize=(14, 10))
plt.plot(predicted_prices, label='prediction')
plt.plot(daily_df['rtm_lbmp'].values, label='actual RTM')
plt.plot(daily_df['dam_lbmp'].values, label='actual DAM')

plt.legend()


Out[29]:
<matplotlib.legend.Legend at 0x2305cc90>

How Did We Do?

Our data and model helps us predict the realtime electricity cost the day before. If you ran a company with significant power demands, you might be able to use such a model to decide whether or not to buy electricity in advance or on demand. Let's see how it would work!

  1. Learn about Autoregressive Integrated Moving Average models
  2. Apply it to a relevant dataset
  3. ????
  4. PROFIT!!!

In [30]:
# Profit Check
len(predicted_prices)
len(daily_df['rtm_lbmp'].values[10:])

print("--- Trading Log ---")

i = 251
PnL = np.zeros(100)


while i < 351:
    if (predicted_prices[i] < daily_df['dam_lbmp'].values[i]) and (daily_df['rtm_lbmp'].values[i+1] < daily_df['dam_lbmp'].values[i]):
        
        # if our model says the DAM is overpriced, then don't pre-buy and buy at the realtime price
        
        print("Buy RTM, +", daily_df['dam_lbmp'].values[i] - daily_df['rtm_lbmp'].values[i+1])
        PnL[i-251] = daily_df['dam_lbmp'].values[i] - daily_df['rtm_lbmp'].values[i+1]
        
    elif (predicted_prices[i] > daily_df['dam_lbmp'].values[i]) and (daily_df['rtm_lbmp'].values[i+1] > daily_df['dam_lbmp'].values[i]):
        
        # if our model says the DAM is underpriced, pre-buy the electricity so you don't have to pay realtime price 
        
        print("Buy DAM, +", daily_df['rtm_lbmp'].values[i+1] - daily_df['dam_lbmp'].values[i] )
        PnL[i-251] = daily_df['rtm_lbmp'].values[i+1] - daily_df['dam_lbmp'].values[i]
        
    else:
        
        # if we were wrong, we lose money :(
        
        print("Lose $$, -", max(daily_df['rtm_lbmp'].values[i+1] - daily_df['dam_lbmp'].values[i],daily_df['dam_lbmp'].values[i] - daily_df['rtm_lbmp'].values[i+1]))
        PnL[i-251] = min(daily_df['rtm_lbmp'].values[i+1] - daily_df['dam_lbmp'].values[i],daily_df['dam_lbmp'].values[i] - daily_df['rtm_lbmp'].values[i+1])
    i = i+1


--- Trading Log ---
('Buy DAM, +', 4.7812499999999964)
('Lose $$, -', 2.4829166666666715)
('Buy RTM, +', 17.576249999999991)
('Buy DAM, +', 0.96041666666667069)
('Buy DAM, +', 18.397499999999997)
('Buy DAM, +', 4.6199999999999939)
('Buy RTM, +', 4.8508333333333411)
('Lose $$, -', 0.82583333333333542)
('Buy RTM, +', 13.429583333333326)
('Buy RTM, +', 25.946249999999988)
('Buy RTM, +', 14.424166666666668)
('Buy RTM, +', 9.9324999999999974)
('Buy RTM, +', 3.8166666666666558)
('Buy RTM, +', 1.2629166666666691)
('Buy DAM, +', 6.7224999999999966)
('Lose $$, -', 0.052083333333332149)
('Buy DAM, +', 4.0216666666666612)
('Buy DAM, +', 13.397499999999994)
('Buy DAM, +', 10.117500000000003)
('Buy RTM, +', 42.542500000000004)
('Buy RTM, +', 8.0600000000000058)
('Buy DAM, +', 5.0695833333333304)
('Lose $$, -', 2.4054166666666674)
('Buy RTM, +', 3.8545833333333306)
('Buy DAM, +', 2.8687499999999986)
('Buy RTM, +', 3.5899999999999999)
('Buy DAM, +', 5.9754166666666606)
('Buy DAM, +', 8.5354166666666629)
('Lose $$, -', 1.1391666666666609)
('Lose $$, -', 3.2083333333333428)
('Buy DAM, +', 1.1404166666666633)
('Buy DAM, +', 14.305416666666668)
('Buy DAM, +', 6.8320833333333297)
('Buy DAM, +', 4.9129166666666713)
('Buy DAM, +', 1.5162499999999994)
('Buy DAM, +', 4.5512499999999889)
('Buy DAM, +', 6.9862500000000054)
('Lose $$, -', 5.7316666666666585)
('Lose $$, -', 6.875833333333329)
('Lose $$, -', 5.0062500000000014)
('Buy RTM, +', 2.0275000000000034)
('Buy RTM, +', 8.3766666666666616)
('Buy DAM, +', 2.2762499999999939)
('Lose $$, -', 0.70875000000000199)
('Buy DAM, +', 1.1291666666666629)
('Buy DAM, +', 6.8625000000000007)
('Buy DAM, +', 4.0633333333333432)
('Lose $$, -', 1.0916666666666721)
('Buy DAM, +', 2.980833333333333)
('Buy DAM, +', 9.6670833333333412)
('Buy RTM, +', 7.178333333333331)
('Buy RTM, +', 5.9908333333333381)
('Buy DAM, +', 9.9066666666666627)
('Buy RTM, +', 14.988749999999992)
('Buy DAM, +', 4.6975000000000016)
('Buy DAM, +', 1.5120833333333401)
('Buy DAM, +', 5.6191666666666755)
('Buy DAM, +', 8.5974999999999948)
('Buy DAM, +', 6.176666666666673)
('Buy DAM, +', 10.897083333333331)
('Lose $$, -', 1.960833333333337)
('Buy RTM, +', 4.7395833333333357)
('Buy RTM, +', 3.5533333333333275)
('Buy DAM, +', 10.574999999999999)
('Buy DAM, +', 12.364166666666662)
('Buy DAM, +', 2.3991666666666713)
('Buy DAM, +', 11.910416666666665)
('Lose $$, -', 0.45500000000000185)
('Buy DAM, +', 5.4337500000000034)
('Lose $$, -', 1.5620833333333337)
('Lose $$, -', 0.65583333333334082)
('Buy DAM, +', 11.046250000000001)
('Buy DAM, +', 9.148750000000005)
('Buy DAM, +', 9.275416666666672)
('Lose $$, -', 2.66458333333334)
('Buy RTM, +', 0.58750000000000213)
('Buy RTM, +', 13.678750000000008)
('Buy DAM, +', 4.9604166666666689)
('Buy DAM, +', 8.8345833333333381)
('Buy RTM, +', 7.3291666666666622)
('Lose $$, -', 1.1929166666666688)
('Lose $$, -', 2.8512499999999967)
('Buy DAM, +', 1.0904166666666661)
('Buy RTM, +', 3.7537499999999859)
('Buy DAM, +', 5.2866666666666653)
('Buy DAM, +', 3.8229166666666714)
('Buy RTM, +', 19.479166666666671)
('Lose $$, -', 0.14541666666666586)
('Buy RTM, +', 44.968749999999986)
('Lose $$, -', 4.2724999999999937)
('Buy DAM, +', 3.6150000000000055)
('Buy DAM, +', 1.1541666666666615)
('Lose $$, -', 1.9920833333333334)
('Lose $$, -', 6.7012499999999982)
('Buy DAM, +', 1.1079166666666662)
('Buy DAM, +', 0.061666666666667425)
('Buy RTM, +', 15.491250000000004)
('Buy DAM, +', 4.5175000000000018)
('Buy DAM, +', 5.1470833333333346)
('Buy DAM, +', 9.932916666666662)

In [31]:
cumPnL = np.cumsum(PnL)
plt.figure(figsize=(10, 8))
plt.plot(cumPnL, label='PnL')
plt.legend()


Out[31]:
<matplotlib.legend.Legend at 0x22f3f970>

It looks like the model works because it predicts every so slightly closer to the true RTM than the DAM!


In [32]:
dam_adj = daily_df['rtm_lbmp'].values[10:]-daily_df['dam_lbmp'].values[:-10]
mod_adj = daily_df['rtm_lbmp'].values[10:]-predicted_prices[:-1]
plt.figure(figsize=(10, 8))
plt.plot(dam_adj, label='DAM error')
plt.plot(mod_adj, label='Model error')
plt.legend()


Out[32]:
<matplotlib.legend.Legend at 0x23c04b30>

Problems

There are MANY potential problems with this approach that need to be mentioned. Though the results seem okay, this is a good example of a model that might fit but not be meaningful.

Seasonality: Data with simple trends can be differenced once to achieve stationarity. However, you will often find data that varies by season. In our case, weather data and electricity usage are strong candidates for having some sort of seasonal trend. To deal with this, we would need to grab several years of data and incoporate the yearly differences (between Jan 1 2015 and Jan 1 2016, for example) in addition. Seasonal ARIMA (SARIMA) is a tool for this purpose.

Parameter Estimation: It could easily take a whole workshop to discuss choosing model form. The parameters chosen ARIMA(0,0,1) may be inappropriate. When doing ARIMA, we want the y variable (RTM in our case) to be stationary (that is, the distribution doesn't change over time). This assumption of stationary may seem suspect, especially given the winter activity (discussed in seasonality section below). You should use of the Dickey Fuller test to make sure stationarity is appropriate.

Additionally, the AR(p) and MA(q) terms need to be determined by using ACF and PACF. The ACF tells you how much an observation at time i is related to observations at times i-1, i-2, etc. The PACF is similar to an ACF but it controls for the values of the time series at all shorter lags (how much does the 3rd lag tell me beyond what the 1st and 2nd ones do?). See it in action below.


In [33]:
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(np.diff(daily_df['rtm_lbmp'].values), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(np.diff(daily_df['rtm_lbmp'].values), lags=40, ax=ax2)


c:\python27\lib\site-packages\matplotlib\collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Interpreting these plots is a bit of an art as well as a science. There are many different rules that you can use to help you determine the "correct" parameters for an ARIMA process, though it's important to note that there isn't always a right answer; in fact, it's not always possible to model some data using ARIMA.


In [ ]: